3 research outputs found

    Efficient instruction and data caching for high-performance low-power embedded systems

    Get PDF
    Although multi-threading processors can increase the performance of embedded systems with a minimum overhead, fetching instructions from multiple threads each cycle also increases the pressure on the instruction cache, potentially harming the performance/consumption ratio. Instruction caches are responsible of a high percentage of the total energy consumption of the chip, which for battery-powered embedded devices becomes a critical issue. A direct way to reduce the energy consumption of the first level instruction cache is to decrease its size and associativity. However, demanding applications, and specially applications with several threads running together, might suffer a dramatic performance slow down, or even increase the total energy consumption of the cache hierarchy, due to the extra misses incurred. In this work we introduce iLP-NUCA (Instruction Light Power NUCA), a new instruction cache that replaces the conventional second level cache (L2) and improves the Energy–Delay of the system. We provided iLP-NUCA with a new tree-based transport network-in-cache that reduces both the cache line service latency and the energy consumption, regarding the former LP-NUCA implementation. We modeled in our cycle-accurate simulation environment both conventional instruction hierarchies and iLP-NUCAs. Our experiments show that, running SPEC CPU2006, iLP-NUCA, in comparison with a state–of–the–art high performance conventional cache hierarchy (three cache levels, dedicated L1 and L2, shared L3), performs better and consumes less energy. Furthermore, iLP-NUCA reaches the performance, on average, of a conventional instruction cache hierarchy implementing a double sized L1, independently of the number of threads. This translates into a reduction of the Energy–Delay product of 21%, 18%, and 11%, reaching 90%, 95%, and 99% of the ideal performance for 1, 2, and 4 threads, respectively. These results are consistent for the considered applications distribution, and bigger gains are in the most demanding applications (applications with high instruction cache requirements). Besides, we increase the performance of applications with several threads without being detrimental for any of them. The new transport topology reduces the average service latency of cache lines by 8%, and the energy consumption of its components by 20%

    Coherent vs. non-coherent last level on-chip caches: an evaluation of the latency and capacity trade-offs

    Get PDF
    El desorbitado consumo energético de los centros de datos actuales y la creciente preocupación por el medio ambiente han llevado a que las tecnologías de la información deban plantearse cómo reducir costes, a la vez que preservar el medio ambiente, en futuros centros de datos. ARM, en un consorcio con Nokia, IMEC, EPFL (Escuela Politécnica Federal de Lausanne) y UCY (Universidad de Chipre), lidera el proyecto EuroCloud, en donde se pretende desarrollar una nueva generación de servidores-on-chip con tecnología 3D y de bajo consumo para servicios de computación en nube (cloud computing). EuroCloud propone un nuevo servidor-on-chip de muy bajo consumo, utilizando procesadores ARM, aceleradores de hardware y memoria DRAM en chip integrada en 3D. En este proyecto hemos estudiado uno de los componentes principales del chip del proyecto EuroCloud, la jerarquía de memoria cache en chip, haciendo una comparación entre diferentes opciones para su organización. La configuración de la jerarquía de memoria cache en chip afectará al tiempo medio de acceso a memoria y, en consecuencia, influenciará el rendimiento global. El chip que hemos estudiado está compuesto por dos clusters. Cada cluster contiene dos procesadores con sus respectivas caches de nivel uno privadas y una porción del segundo nivel de memoria cache (en este caso el segundo nivel de cache es el último nivel de la jerarquía). Este último nivel de cache se encuentra, por tanto, físicamente distribuido entre los clusters y puede ser configurado de forma distinta. En concreto, admite dos organizaciones: caches compartidas o caches privadas. En este proyecto hemos analizado dos organizaciones: una organización compartida, en la que los dos clusters comparten el último nivel de la memoria cache, y que pretende conseguir aprovechar al máximo la capacidad efectiva de la cache, y una organización en Cluster, en donde el último nivel de cache es privado para cada cluster. En este último caso, damos prioridad a un acceso más rápido (menor latencia) a este nivel de la jerarquía. Dentro de una organización en Cluster, hemos estudiado la posibilidad de introducir un mecanismo de coherencia para este nivel. Tras una extensa labor de investigación sobre el estado del arte del tema y sobre la organización del chip y su arquitectura, hemos modelado los dos diseños antes mencionados en nuestra plataforma de simulación y simulado cargas de trabajo representativas. Hemos analizado en detalle los resultados obtenidos para distintos tamaños de memoria cache y concluido que una organización en Cluster, en general, funciona mejor. Un diseño Cluster se beneficia de una latencia de acceso más baja a la vez que proporciona en la mayoría de los casos la capacidad de cache necesaria para obtener un buen rendimiento. En los casos en que capacidad es más crítica que acceso o en cargas de trabajo con poca localidad, el diseño Compartido aventaja al diseño Cluster. En cuanto a los mecanismos de coherencia para este nivel de la jerarquía, creemos que, para el tipo de servidor estudiado y el tipo de aplicaciones consideradas, son innecesarios. Adicionalmente, hemos extendido el entorno de simulación utilizado, así como profundizado en la metodología de simulación para conseguir unos resultados más ajustados

    Exploiting Natural On-chip Redundancy for Energy Efficient Memory and Computing

    Get PDF
    Power density is currently the primary design constraint across most computing segments and the main performance limiting factor. For years, industry has kept power density constant, while increasing frequency, lowering transistors supply (Vdd) and threshold (Vth) voltages. However, Vth scaling has stopped because leakage current is exponentially related to it. Transistor count and integration density keep doubling every process generation (Moore’s Law), but the power budget caps the amount of hardware that can be active at the same time, leading to dark silicon. With each new generation, there are more resources available, but we cannot fully exploit their performance potential. In the last years, different research trends have explored how to cope with dark silicon and unlock the energy efficiency of the chips, including Near-Threshold voltage Computing (NTC) and approximate computing. NTC aggressively lowers Vdd to values near Vth. This allows a substantial reduction in power, as dynamic power scales quadratically with supply voltage. The resultant power reduction could be used to activate more chip resources and potentially achieve performance improvements. Unfortunately, Vdd scaling is limited by the tight functionality margins of on-chip SRAM transistors. When scaling Vdd down to values near-threshold, manufacture-induced parameter variations affect the functionality of SRAM cells, which eventually become not reliable. A large amount of emerging applications, on the other hand, features an intrinsic error-resilience property, tolerating a certain amount of noise. In this context, approximate computing takes advantage of this observation and exploits the gap between the level of accuracy required by the application and the level of accuracy given by the computation, providing that reducing the accuracy translates into an energy gain. However, deciding which instructions and data and which techniques are best suited for approximation still poses a major challenge. This dissertation contributes in these two directions. First, it proposes a new approach to mitigate the impact of SRAM failures due to parameter variation for effective operation at ultra-low voltages. We identify two levels of natural on-chip redundancy: cache level and content level. The first arises because of the replication of blocks in multi-level cache hierarchies. We exploit this redundancy with a cache management policy that allocates blocks to entries taking into account the nature of the cache entry and the use pattern of the block. This policy obtains performance improvements between 2% and 34%, with respect to block disabling, a technique with similar complexity, incurring no additional storage overhead. The latter (content level redundancy) arises because of the redundancy of data in real world applications. We exploit this redundancy compressing cache blocks to fit them in partially functional cache entries. At the cost of a slight overhead increase, we can obtain performance within 2% of that obtained when the cache is built with fault-free cells, even if more than 90% of the cache entries have at least a faulty cell. Then, we analyze how the intrinsic noise tolerance of emerging applications can be exploited to design an approximate Instruction Set Architecture (ISA). Exploiting the ISA redundancy, we explore a set of techniques to approximate the execution of instructions across a set of emerging applications, pointing out the potential of reducing the complexity of the ISA, and the trade-offs of the approach. In a proof-of-concept implementation, the ISA is shrunk in two dimensions: Breadth (i.e., simplifying instructions) and Depth (i.e., dropping instructions). This proof-of-concept shows that energy can be reduced on average 20.6% at around 14.9% accuracy loss
    corecore